VIS[0,BGB] - www.SailDart.org

perm filename VIS[0,BGB] blob sn#179600 filedate 1974-08-29 generic text, type C, neo UTF8

COMMENT ⊗ VALID 00017 PAGES
C REC PAGE DESCRIPTION
C00001 00001
C00003 00002 {⊂C<NαVISION THEORY.λ30P68I425,0JCFA} SECTION 6.
C00005 00003 ⊂6.1 A Geometric Feedback Vision System.⊃
C00007 00004 Between the top and the bottom, between images and the task
C00009 00005 {|λ10JUFA}
C00012 00006 The lower part of the above
C00017 00007 ⊂6.2 Vision Tasks.⊃
C00019 00008 First, there is the robot chauffeur task. In 1969, John
C00023 00009
C00028 00010
C00030 00011 ⊂6.3 Vision System Design Arguments.⊃
C00037 00012
C00041 00013 ⊂6.4 Mobile Robot Vision.⊃
C00045 00014
C00047 00015 {|λ10JAFA}
C00050 00016
C00054 00017 ⊂6.5 Summary and Related Vision Work.⊃
C00062 ENDMK
C⊗;
{⊂C;<N;αVISION THEORY.;λ30;P68;I425,0;JCFA} SECTION 6.
{JCFD} COMPUTER VISION THEORY.
{λ10;W250;JAFA}
6.0 Introduction to Computer Vision Theory.
6.1 A Geometric Feedback Vision System.
6.2 Vision Tasks.
6.3 Vision System Design Arguments.
6.4 Mobile Robot Vision.
6.5 Summary and Related Vision Work.

{λ30;W0;I900,0;JUFA}
⊂6.0 Introduction to Computer Vision Theory.⊃

Computer vision concerns programming a computer to do a task
that demands the use of an image forming light sensor such as a
television camera. The theory I intend to elaborate is that general
3-D vision is a continuous process of keeping an internal visual
simulator in sync with perceived images of the external reality, so
that vision tasks can be done more by reference to the simulator's
model and less by reference to the original images. The word
<theory>, as used here, means simply a set of statements presenting
a systematic view of a subject; specifically, I wish to exclude the
connotation that the theory is a natural theory of vision. Perhaps
there can be such a thing as an <artificial theory> which extends
from the philosophy thru the design of an artifact.

⊂6.1 A Geometric Feedback Vision System.⊃

Vision systems mediate between images and world models; these
two extremes of a vision system are called, in the jargon, the
<bottom> and the <top> respectively. In what follows, the word
<image> will be used to refer to the notion of a 2-D data structure
representing a picture; a picture being a rectangle taken from the
pattern of light formed by a thin lens on the nearly flat
photoelectric surface of a television camera's vidicon. On the other
hand, a <world model> is a data structure which is supposed to
represent the physical world for the purposes of a task processor. In
particular, the main point of this thesis concerns isolating a
portion of the world model (called the 3-D geometric world model) and
placing it below most of the other entities that a task processor has
to deal with. The vision hierarchy, so formed, is illustrated in box 6.1.
{|λ10;JA}
BOX 6.1 {JC} VISION SYSTEM HIERARCHY.

{JC} Task Processor
{JC} |
{JC} Task World Model
The Top → {JC} |
{JC} 3-D Geometric Model
{JC} |
The Bottom → {JC} 2-D Images
{|λ30;JU}
Between the top and the bottom, between images and the task
world model, a general vision system has three distinguishable modes
of operation: recognition, verification and description. Recognition
vision can be characterized as bottom up, what is in the picture is
determined by extracting a set of features from the image and by
classifing them with respect to prejudices which must be taught.
Verification vision is top down or model driven vision, and involves
predicting an image followed by comparing the predicted image and a
perceived image for differences which are expected but not yet
measured. Descriptive vision is bottom up or data driven vision and
involves converting the image into a representation that makes it
possible (or easier) to do the desired vision task. I would like to
call this third kind of vision "revelation vision" at times,
although the phrase "descriptive vision" is the term used by most
members of the computer vision community.
{|λ10;JU;FA}
Box 6.2 {JC} THREE BASIC MODES OF VISION.

1. Recognition Vision - Feature Classification. (bottom up into a prejudiced top).
2. Verification Vision - Model Driven Vision. (nearly pure top down vision).
3. Descriptive Vision - Data Driven Vision. (nearly pure bottom up vision).
{|λ30;JU}
There are now enough concepts to outline a feedback system.
By placing a 3-D geometric model between top and bottom; recognition vision can
be done mapping 3-D (rather than 2-D) features into ∀he task world model
with descriptive vision and verification vision linking
the 2-D and 3-D models in a relatively dumb, mechanical fashion.
Previous attempts to use recognition vision, to bridge directly the
gap between 2-D images (of 3-D objects) and the task world
model, have been frustrated because the characteristic 2-D image
features of a 3-D object are very dependent on the 3-D physical
processes of occultation, rotation and illumination. It is these
processes that will have to be modeled and understood before the
features relevant to the task processor can be deduced from the
perceived images. The arrangement of these elements is diagramed
below.{|λ10;JA}
Box 6.3 {JC} BASIC FEEDBACK VISION SYSTEM DESIGN.

{JC} Task World Model
{JC} ↑
{JC} RECOGNITION
{JC} ↑
{JC} 3-D geometric model
{JC} ↑ ↓
{JC} DESCRIPTION VERIFICATION
{JC} ↑ ↓
{JC} 2-D images
{|λ30;JU}
The lower part of the above
diagram is the feedback loop of the 3-D geometric
vision system. Depending on circumstances, the vision system may
run almost entirely top-down (verification vision) or
bottom-up (revelation vision). Verification vision is all that is
required in a well known predictable environment; whereas, revelation
vision is required in a brand new (tabula rasa) or rapidly changing
environment. Thus revelation and verification form a loop, bottom-up
and top-down. First, there is revelation that unprejudically builds
a 3-D model; and second, the model is verified by testing image
features predicted from the model. This loop like structure
has been noted before by others; it is a form of what Tenenbaum (71)
called <accommodation> and it is a form of what Falk (69) called
<heuristic vision>; however I will go along with what I think is the
current majority of vision workers who call it <feedback vision>.

Completing the design, the images and worlds are
constructed, manipulated and compared by a variety of processors, the
topmost of which is the task processor. Since the task processor is
expected to vary with the application, it would be expedient if it
could be isolated as a user program that calls on utility routines
of an appropriate vision sub-system. Immediately below the task
processor are the 3-D recognition routines and the 3-D modeling
routines. The modeling routines underlie most everything because
they are used to create, alter and access the models.{
|;λ10;JAFA}
Box 6.4 {JC} PROCESSORS OF A 3-D VISION SYSTEM.
{↓}
0. The task processor.
1. 3-D recognition.
2. 3-D modeling routines.
3. Reality simulator.
{↑;W560;}
4. Image analyser.
5. Image synthesizer.
6. Locus solvers.
7. Comparators: 2D and 3D.
{|;λ30;JUFA}
The remaining processors include the reality simulator which
does mechanics for modeling motion, collision and gravity.
Also there are image analyzers, which do image enhancement and
conversions such as converting video rasters into line drawings.
There is an image synthesizer, which does hidden line and surface
elimination, for verification by comparing synthetic images from the
model with perceived images of reality. There are three kinds of
locus solvers that compute numerical descriptions for cameras, light
sources and physical objects. Finally, there is of course a large
number of (at least ten) different compare processors for confirming or
denying correspondences among entities in each of the different kinds
of images and 3-D models.

⊂6.2 Vision Tasks.⊃

The 3-D vision research problem being discussed is that of
finding out how to write programs that can see in the real world.
Related vision problems include: modeling human
perception, solving visual puzzles (non-real world), and developing
advanced automation techniques (ad hoc vision). In order to approach
the problem, specific programming tasks are proposed and solutions
are sought, however a programming task is different than a reseach problem
because many vision
tasks can be done without vision. The vision solution to be found
should be able to deal with real images, should include the
continuity of the visual process in time and space, and should be
more general purpose and less ad hoc. These three requirements
(reality, continuity, and generality) will be developed by surveying
six examples of computer vision tasks.{Q}
{|;λ10;JAFA}
BOX 6.5{JC} SIX EXAMPLES OF COMPUTER VISION TASKS.
{↓}
<Cart Related Tasks>.
1. The Chauffeur Task.
2. The Explorer Task.
3. The Soldier Task.
{↑;W650;}
<Table Top Related Tasks>.
4. Turntable Task.
5. The Blocks Task.
6. Machine Assembly Tasks.
{|;λ30;JUFA}
First, there is the robot chauffeur task. In 1969, John
McCarthy asked me to consider the vision requirements of a computer
controlled car such as he depicted in an unpublished essay. The idea
is that a user of such an automatic car would request a destination;
the robot would select a route from an internally stored road map;
and it would then proceed to its destination using visual data. The
problem involves representing the road map in the computer and
establishing the correspondence between the map and the appearance of
the road as the automatic chauffeur drives the vehicle along the
selected route. Lacking a computer controlled car, the problem was
abstracted to that of tracing a route along the driveways and parking
lots that surround the Stanford A.I. Laboratory using a television
camera and transmitter mounted on a radio controlled electric cart.
The robot chauffeur task could be solved by non-visual means such as
by railroad like guidance or by inertial guidance; to preserve the
vision aspect of the problem, no particular artifacts should be
required along a route (landmarks must be found, not placed); and the
extent of inertial dead reckoning should be noted.

Second, there is the task of a robot explorer. In (McCarthy
1964) there is a description of a robot for exploring Mars. The robot
explorer was required to run for long periods of time without human
intervention because the signal transmission time to Mars is as great
as twenty minutes and because the 23.5 hour Martian day would place
the vehicle out of Earth sight for twelve hours at a time. (This
latter difficulty could be avoided at the expense of having a set of
communication relay satellites in orbit around Mars.) The task of the
explorer would be to drive around mapping the surface, looking for
interesting features, and doing various experiments. To be prudent,
a Mars explorer should be able to navigate without vision; this can
be done by driving slowly and by using a tactile collision and
crevasse detector. If the television system fails, the core samples
and so on can still be collected at different Martian sites without
unusual risk to the vehicle due to visual blindness.

The third vision task is that of the robot soldier, tank,
sentry, pilot or policeman. The problem has several forms which are
quite similar to the chauffeur and the explorer with the additional
goal of doing something to coerce an opponent. Although this vision
task has not yet been explicitly attempted at Stanford, to the best
of my knowledge, the reader should be warned that a thorough solution
to any of the other tasks almost assures the Orwellian technology to
solve this one.

Fourth, the turntable task is to construct a 3-D model from
a sequence of 2-D television images taken of an object rotated on a
turntable. The turntable task was selected as a simplification of
the explorer task and is an example of a nearly pure descriptive
vision task.

Fifth, the classic blocks vision task consists of two parts:
first convert a video image into a line drawing; second, make a
selection from a set of predefined prototype models of blocks that
accounts for the line drawing. In my opinion, this vision task
emphasizes three pitfalls: single image vision, line drawings and
blocks. The greatest pitfall, in the usual blocks vision task, is the
presumption that a single image is to be solved; thus diverting
attention away from the two most important depth perception
mechanisms which are motion parallax and stereo parallax. The second
pitfall is that the usual notion of a perspective line drawing is not
a natural intermediate state; but is rather a very sophisticated and
platonic geometric idea. The perfect line drawing lacks photometric
information; even a line drawing with perfect shadow lines included
will not resemble anything that can readily be gotten by processing
real television pictures. Curiously, the lack of success in deriving
line drawings from real television images of real blocks has not
dampened interest in solving the second part of the problem. The
perfect line drawing puzzle, was first worked on by Guzman (68) and
extended to perfect shadows by Waltz (72); nevertheless, enough remains so
that the puzzle will persist on its own merits, without being
closely relevant to real world computer vision. Even assuming that
imperfect line drawings are given, the blocks themselves, have
lead such researchers as Falk (69) and Grape (73) to concentrate on vertex/edge
classification schemes which have not been extended beyond the blocks
domain. The blocks task could be rehabilitated by concentrating on
photometric modeling and the use multiple images for depth
perception.

Sixth, the Stanford Artificial Intelligence Laboratory has
recently (1974) begun work on a National Science Foundation Grant
supporting research in automatic machine assembly. In particular,
effort will be directed to developing techniques that can be
demonstrated by automatically assembling a chain saw gasoline
engine. Two vision questions in such a machine assembly task are,
where is the part and where is the hole; these questions will be
initially handled by composing ad hoc part and hole detectors for
each vision step required for the assembly.

The point of this task survey was to illustrate what is and is
not a task requiring real 3-D vision; and to point out that caution
has to be taken to preserve the vision aspects of a given task. In
the usual course of vision projects, a single task or a single tool
unfortunately dominates the research; my work is no exception, the
one tool is 3-D modeling, and the task that dominated the formative
stages of the research is that of the robot chauffeured cart. A
better understanding of the ultimate nature of computer vision can be
obtained by keeping the several tasks and the several tools in mind.

⊂6.3 Vision System Design Arguments.⊃

The physical information most directly relevant to vision is
the location, extent and light scattering properties of solid opaque
objects; the location, orientation and projection of the camera that
takes the pictures; and the location and nature of the light that
illuminates the world. The transformation rules of the everyday
world that a programmer may assume, a priori, are the laws of
physics. The arguments against geometric modeling divide
into two categories: the reasonable and the intuitive.
The reasonable arguments attack 3-D geometric modeling by
comparing it to another modeling alternative, some of which are
listed in Box 6.6. Actually, the domains
of efficiency of the possible kinds of models do not
greatly overlap; and an artificial intellect will have some portion of
each kind. Nevertheless, I feel that 3-D geometric modeling is
superior for the task at hand, and that the other models are less
relevant to vision.{Q}
{|;λ10;JAFA}
BOX 6.6{JVJC} Alternatives to 3-D Geometric Modeling in a Vision System.{I∂20,0;}
1. Image memory and with only the camera model in 3-D.
2. Statistical world model, e.g. Duda & Hart.
3. Procedural Knowledge, e.g. Hewitt & Winograd.
4. Semantic knowledge, e.g. Wilkes & Shank.
5. Formal Logic models, e.g McCarthy & Hayes.
6. Syntactic models.
{|;λ30;JUFA}
Perhaps the best alternative to a 3-D geometric model is to
have a library of little 2-D images describing the appearance of
various 3-D loci from given directions. The advantage would be that
a sophisticated image predictor would not be required; on the other
hand the image library is potentially quite large and that even with
a huge data base new views and lighting of familiar objects and
scenes cannot be anticipated. A second alternative is the statistical
world model used in the pattern recognition paradigm.
Such modeling might be added to the geometric model; however, alone
the statistical abstraction of world features in the presence
of occultation, rotation and illumination seems as hopeless
as the abstraction of a man's personality from the
pattern of tea leaves in his cup.

Procedural knowledge models represent the world in terms of
routines (or actors) which either know or can compute the answer to a
question about the world. Semantic models represent the world in term
of a data structure of conceptual statements; and formal logic models
represent the world in terms of first order predicate calculus or in
terms of a situation calculus. The procedural, semantic and formal
logic world models are all general enough to represent a
vision model and in a theoretical sense they are merely other notations
for 3-D geometric modeling. However in practice, these three
modeling regimes are not efficient holders and handlers of
quantitative geometric data; but are rather intended for a higher
level of abstract reasoning. Another alleged advantage of these
higher models is that they can represent partial knowledge and
uncertainty, which in a geometric model is implicit, in that
structures are missing or incomplete. For example, McCarthy and
Feldman demand that when a robot has only seen the front of an office
desk that it should be able to draw inferences from its model about the back
of the desk; I feel that this so called advantage is not required by
the problem and that basic visual modeling is on a more agnostic
level.

The syntactical approach to descriptive vision is that an
image is a sentence of a picture grammar and that consequently the
image description should be given in terms of a sequence of grammar
transformations rules. Again this paradigm is valid in principle but
impractical for real images of 3-D objects because simple
replacement rules cannot readily express rotation, perspective,
and photometric transformations. On the other hand, the syntactical
model has been used to describe perfect line drawings of 3-D objects,
(Gips 74).

The intuitive arguments include the opinions that geometric
modeling is too numerical, too exact, or too non-human to be relevant
for computer vision research. Against such intuitions, I wish to pose
two fallacies. First, there is the natural mimicry fallacy, which is
that it is false to insist that a machine must mimic nature in order
to achieve its design goals. Boeing 747's are not covered with
feathers; trucks do not have legs; and computer vision need not
simulate human vision. The advocates of the uniqueness of natural
intelligence and perception will have to come up with a rather
unusual uniqueness proof to establish their conjecture. In the
meantime, one should be open minded about the potential forms a
perceptive consciousness can take.

Second, there is the self introspection fallacy, which is
that it is false to insist that one's introspections about how he
thinks and sees are direct observations of thought and sight. By
introspection some conclude that the visual models (even on a low
level) are essentially qualitative rather than quantitative. My belief
is that the vision processing of the brain is quite quantitative and
only passes into qualities at a higher level of processing. In either
case, the exact details of human visual processing are inaccessible
to conscious self introspection.

Although describing the above two fallacies might soften a
person's prejudice against numerical geometric modeling, some
important argument or idea is missing that would be convincing short
of the final achievement of computer vision. Contrariwise, I have
not heard an argument that would change my prejudice in favor of such
models. Nevertheless beyond prejudice, my theory would be proved
wrong if a really powerful computer vision system is ever built
without using any geometric models worth speaking of, perhaps by
employing an elaborate stimulus response paradigm.

⊂6.4 Mobile Robot Vision.⊃

The elements discussed so far will now be brought together
into a system design for performing mobile robot vision. The proposed
system is illustrated below in the block diagram in Box 6.7. (The
diagram is called a mandala in that
a <mandala> is any circle-like system diagram). Although, the robot
chauffeured cart was the main task theme for this research; I have
failed to date, August 1974, to achieve the hardware and software
required to drive the cart around the laboratory under its own
control. Nevertheless, this necessarily theoretical cart system has
been of considerable use in developing the visual 3-D modeling
routines and theory, which are the subject of this thesis.
{|;JV;FA}
BOX 6.7{JC} CART VISION MANDALA.
{W300;λ4;F2}
→→→→→→→→→→→→→→→→→→→ PERCEIVED →→→→→→ REALITY →→→→→→ PREDICTED →→→→
↑ WORLD SIMULATOR WORLD ↓
↑ ↓
↑ ↓
↑ PERCEIVED →→→→→→ CART →→→→→→→→ PREDICTED →→→↓
↑ CAMERA LOCUS DRIVER CAMERA LOCUS ↓
↑ ↑ ↓ ↓
↑ ↑ ↓ ↓
↑ ↑ THE CART PREDICTED→→→→↓
BODY CAMERA SUN LOCUS ↓
LOCUS LOCUS ↓
SOLVER SOLVER ↓
↑ ↑ ↓
↑ ↑ ↓
REVEAL VERIFY IMAGE
COMPARE COMPARE SYNTHESIZER
↑ ↑ ↑ ↑ ↓
↑ ↑ ↑ ↑ ↓
↑ ←← PERCEIVED→→→→→↑ ↑←←←←←←←←←←←←←←←←←←←← PREDICTED ←←←←←←←↓
←←←←← MOSAIC IMAGE MOSAIC IMAGE ↓
↑ ↑ ↓
↑ ↑ ↓
↑ ↑ ↓
PERCEIVED PREDICTED ↓
CONTOUR IMAGE CONTOUR IMAGE ↓
↑ ↑ ↓
↑ ↑ ↓
↑ ↑ ↓
PERCEIVED PREDICTED ←←←←←←←←←
VIDEO IMAGE VIDEO IMAGE
↑
↑
↑
TELEVISION
CAMERA

{|;λ30;JUFA}
The robot chauffeur task involves establishing the
correspondence between an internal road map and the appearance of the
road in order to steer a vehicle along a predefined path. For a first
cut, the planned route is assumed to be clear, and the cart and the
sun are assumed to be the only movable things in a static world.
Dealing with moving obstacles is a second problem, motion thru a
static world must be dealt with first.

The cart at the Stanford Artificial Intelligence Laboratory
is intended for outdoors use and consists of a piece of plywood, four
bicycle wheels, six electric motors, two car batteries, a television
camera, a television transmitter, a box of digital logic, a box of
relays, and a toy airplane radio receiver. (The vehicle being
discussed is not "Shaky", which belongs to the Stanford Research
Institute's Artificial Intelligence Group. There are two A.I. labs
near Stanford and each has a computer controlled vehicle.) The six
possible cart actions are: run forwards, run backwards, steer to the
left, steer to the right, pan camera to the left, pan camera to the
right. Other than the television camera, there is no telemetry
concerning the state of the cart or its immediate environment.
{|;λ10;JAFA}
BOX 6.8 {JC} A POSSIBLE CART TASK SOLUTION.
1. Predict (or retrieve) 2-D image features.
2. Perceive (take) a television picture and convert into features.
3. Compare (verify) predicted and perceived features.
4. Solve for camera locus.
5. Servo the cart along its intended course.
{|;λ30;JUFA}
The solution to the cart problem, begins with the cart at a
known starting position with a road map of visual landmarks with
known loci. That is, the upper leftmost two rectangles of the cart
mandala are initialized so that the perceived cart locus and the
perceived world correspond with reality. Flowing across the top of
the mandala, the cart driver, blindly moves the cart forward along
the desired route by dead reckoning (say the cart moves five feet and
stops) and the driver updates the predicted cart locus. The reality
simulator is an identity in this simple case because the world is
assumed static. Next the image synthesizer uses the predicted world,
camera and sun to compute a predicted image containing the landmark
features expected to be in view. Now, in the lower left of the
mandala, the cart's television camera takes a perceived picture and
(flowing upwards) the picture is converted into a form suitable for
comparing and matching with the predicted image. Features that are
both predicted and perceived and found to match are used by the
camera locus solver to compute a new perceived camera locus (from
which the cart locus can be deduced). Finally the cart driver compares
the perceived and the predicted cart locus and corrects its course
and moves the cart again, and so on.

The remaining limb of the cart mandala is invoked in order to
turn the chauffeur into an explorer. Perceived images are compared
in time by the reveal compare and new features are located by the
body locus solver and placed into the world model. The generality and
feasibility of such a cart system depends almost entirely on the
representation of the world and the representation of image features.
(The more general, the less feasible). Four smaller cart systems
might be possible using simpler 3-D models.

A first system might consist of a road map, a road model, a
road model generator, a solar ephemeris, an image predictor an
image comparator, a camera locus solver, and a course servo routine.
The roadways and nearby environs are entered into the computer. In
fact, real roadways are constructed from a two dimensional (X,Y)
allignment map showing where the center of the road goes as a curve
composed of line segment and circular arcs; and from a two
dimensional (S,Z) elevation diagram, showing the height of the road
above sea level as a function of distance along the road in a
sequence of linear grades and vertical arcs which (not too
surprising) are nearly cubic splines. A second version, might be made
like the first except that the road model, road model generator, and
image predictor are replaced by a library of road images. In this
system the robot vehicle is trained by being driven down the roads it
is suppose to follow. A third system also might be made like the
first except that the road map is not initially given, and indeed the
road is no longer presumed to exist. Part of the problem becomes
finding a road, a road in the sense of a clear area; this version
yields the cart explorer and if the clear area is found quite rapidly
and the world is updated quite frequently, the explorer can be a
chauffeur that can handle obstacles and moving objects.

⊂6.5 Summary and Related Vision Work.⊃

To recapitulate, three vision system design requirements were
postulated: reality, generality, and continuity. These requirements
were illustrated by discussing a number of vision related tasks.
Next, a vision system was described as mediating between 2-D images
and a world model; with the world model being further broken down
into a 3-D geometric model and a task world model. Between these
entities three basic vision modes were identified: recognition,
verification and revelation (description). Finally, the general
purpose vision system was depicted as a quantitative and description
oriented feedback cycle which maintain a 3-D geometric model for the
sake of higher qualitative, symbolic, and recognition oriented task
processors. Approaching the vision system in greater detail; the role
of seven (or so) essential kinds of processors were explained: the
task processor, 3-D modeling routines, reality simulator, image
analyser, image synthesizer, comparators, and locus solvers. The
processors and data types were assembled into a cart chauffeur
system.

Larry Roberts is justly credited for doing the seminal work
in 3-D Computer Vision; although his thesis (Roberts 63) appeared
over ten years ago the subject has languished dependent on and
overshadowed by the four areas called: Image Processing, Pattern
Recognition, Computer Graphics, and Artificial Intelligence.
Outside the computer sciences, workers in psychology, neurology and
philosophy also seek a theory of vision.

Image Processing involves the study and development of
programs that enhance, transform and compare 2-D images. Nearly all
image processing work can eventually be applied to computer vision in
various circumstances. A survey of this field can be found in an
article by Rosenfeld(69). Image Pattern Recognition involves two
steps: feature extraction and classification. A comprehensive text
about this field with respect to computer vision, has been written by
(Duda and Hart 73). Computer Graphics is the inverse of descriptive
computer vision. The problem of computer graphics is to synthesis
images from three dimensional models; the problem of descriptive
computer vision is to analyze images into three dimensional models.
An introductory text book about this field would be that of (Newman
and Sproull 73). Finally, there is Artificial Intelligence, which in
my opinion is an institution sheltering a heterogenous group of
embryonic computer subjects; the biggest of the present day orphans
include: robotics, natural language, theorem proving, speech
analysis, vision and planning. A more narrow and relevant definition
of artificial intelligence is that it concerns the programming of the
robot task processor which sits above the vision system.

The related vision work of specific individuals has already
been mention in context. To summarize, the present vision work is
related to the early work of Roberts(63) and Sutherland; to recent
work at Stanford: Falk, Feldman and Paul(67), Tenenbaum(72),
Agin(72), Grape(73); to the work at MIT: Guzman, Horn, Waltz,
Krakaurer; to the work at the University of Utah: Warnock, Watkins;
and to work at other places: SRI and JPL. Future progress in computer
vision will proceed in step with better computer hardware, better
computer graphics software, and better world modeling software.
Further vision work at Stanford, which is related to the present
theory is being done by Lynn Quam and Hans Morevac. The machine
assembly task is being pursued both by the Artificial Intelligence
Group of the Stanford Research Institute and by the Hand Eye Project
at Stanford University. Because the demand for doing practical
vision tasks can be satisfied with existing ad hoc methods or by not
using a visual sensor at all; little or no theoretical vision
progress will necessarily result from the achievement of spectacular
robotic industrial assembly demonstations (hire the handicap, blind
robots assembles widgets). On the other hand, since the missing
ingredient for computer vision is the spatial modeling to which
perceive images can be related; I believe that the development of the
technology for generating commercial film and television by computer
for entertainment might make significant contribution to computer
vision.
{L0,-400;H2;X0.6;*HORNY6;}